2025-05-09-12-03
Enigme: Generative Text Puzzles for Evaluating Reasoning in Language Models
Abstract
arXiv:2505.04914v1 Announce Type: new Abstract: Transformer-decoder language models are a core innovation in text based generative artificial intelligence. These models are being deployed as general-purpose intelligence systems in many applications. Central to their utility is the capacity to understand natural language commands and exploit the reasoning embedded in human text corpora to apply some form of reasoning process to a wide variety of novel tasks. To understand the limitations of this approach to generating reasoning we argue that we need to consider the architectural constraints of these systems. Consideration of the latent variable structure of transformer-decoder models allows us to design reasoning tasks that should probe the boundary of their capacity to reason. We present enigme, an open-source library for generating text-based puzzles to be used in training and evaluating reasoning skills within transformer-decoder models and future AI architectures.
摘要
基于Transformer解码器的语言模型是文本生成人工智能的核心创新技术。这些模型正作为通用智能系统被部署于众多应用场景。其功能的核心在于理解自然语言指令的能力,以及利用人类文本语料库中蕴含的推理机制,将某种形式的推理过程应用于各类新颖任务。为理解这种推理生成方法的局限性,我们认为需要考察这些系统的架构约束。通过分析Transformer解码器模型的潜在变量结构,我们得以设计出能够探测其推理能力边界的测试任务。本文提出Enigme——一个开源的文本谜题生成库,用于训练和评估Transformer解码器模型及未来AI架构的推理能力。
Position: Epistemic Artificial Intelligence is Essential for Machine Learning Models to Know When They Do Not Know
Abstract
arXiv:2505.04950v1 Announce Type: new Abstract: Despite the impressive achievements of AI, including advancements in generative models and large language models, there remains a significant gap in the ability of AI to handle uncertainty and generalize beyond the training data. We argue that AI models, especially in autonomous systems, fail to make robust predictions when faced with unfamiliar or adversarial data, as evidenced by incidents with autonomous vehicles. Traditional machine learning approaches struggle to address these issues due to an overemphasis on data fitting and domain adaptation. This position paper posits a paradigm shift towards epistemic artificial intelligence, emphasizing the need for models to learn not only from what they know but also from their ignorance. This approach, which focuses on recognizing and managing uncertainty, offers a potential solution to improve the resilience and robustness of AI systems, ensuring that they can better handle unpredictable real-world environments.
摘要
尽管人工智能已取得令人瞩目的成就,包括生成模型和大语言模型的进步,但其在处理不确定性和训练数据外泛化能力方面仍存在显著不足。我们认为,人工智能模型(尤其是自主系统中的模型)在面对陌生或对抗性数据时无法做出稳健预测,自动驾驶汽车的相关事故便佐证了这一点。传统机器学习方法因过度强调数据拟合和领域适应而难以解决这些问题。本立场论文提出向认知人工智能的范式转变,强调模型不仅需要从已知知识中学习,更需从未知中学习。这种以识别和管理不确定性为核心的方法,为提升人工智能系统的韧性和鲁棒性提供了潜在解决方案,从而确保其能更好地应对不可预测的现实环境。
Towards Artificial Intelligence Research Assistant for Expert-Involved Learning
Abstract
arXiv:2505.04638v1 Announce Type: new Abstract: Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research, yet their reliability and specific contributions to biomedical applications remain insufficiently characterized. In this study, we present \textbf{AR}tificial \textbf{I}ntelligence research assistant for \textbf{E}xpert-involved \textbf{L}earning (ARIEL), a multimodal dataset designed to benchmark and enhance two critical capabilities of LLMs and LMMs in biomedical research: summarizing extensive scientific texts and interpreting complex biomedical figures. To facilitate rigorous assessment, we create two open-source sets comprising biomedical articles and figures with designed questions. We systematically benchmark both open- and closed-source foundation models, incorporating expert-driven human evaluations conducted by doctoral-level experts. Furthermore, we improve model performance through targeted prompt engineering and fine-tuning strategies for summarizing research papers, and apply test-time computational scaling to enhance the reasoning capabilities of LMMs, achieving superior accuracy compared to human-expert corrections. We also explore the potential of using LMM Agents to generate scientific hypotheses from diverse multimodal inputs. Overall, our results delineate clear strengths and highlight significant limitations of current foundation models, providing actionable insights and guiding future advancements in deploying large-scale language and multi-modal models within biomedical research.
摘要
大语言模型(LLMs)与大模态模型(LMMs)已成为科学研究的变革性工具,但其在生物医学应用中的可靠性和具体贡献仍缺乏充分表征。本研究提出ARIEL(专家参与学习的人工智能研究助手),这是一个多模态数据集,旨在评估并增强LLMs与LMMs在生物医学研究中的两项关键能力:总结长篇科学文本和解析复杂生物医学图表。为支持严谨评估,我们创建了两套开源数据集,包含生物医学文献与图表及其配套问题。我们系统性地对开源与闭源基础模型进行基准测试,并引入博士级专家主导的人工评估。此外,通过针对性提示工程与微调策略提升研究论文摘要任务的模型性能,并应用测试时计算扩展增强LMMs的推理能力,其准确率已超越人类专家修正结果。我们还探索了利用LMM智能体从多模态输入生成科学假设的潜力。总体而言,研究结果明确了当前基础模型的优势,同时揭示了显著局限,为生物医学研究中大规模语言与多模态模型的部署提供了可行见解与发展方向。
Large Language Models are Autonomous Cyber Defenders
Abstract
arXiv:2505.04843v1 Announce Type: new Abstract: Fast and effective incident response is essential to prevent adversarial cyberattacks. Autonomous Cyber Defense (ACD) aims to automate incident response through Artificial Intelligence (AI) agents that plan and execute actions. Most ACD approaches focus on single-agent scenarios and leverage Reinforcement Learning (RL). However, ACD RL-trained agents depend on costly training, and their reasoning is not always explainable or transferable. Large Language Models (LLMs) can address these concerns by providing explainable actions in general security contexts. Researchers have explored LLM agents for ACD but have not evaluated them on multi-agent scenarios or interacting with other ACD agents. In this paper, we show the first study on how LLMs perform in multi-agent ACD environments by proposing a new integration to the CybORG CAGE 4 environment. We examine how ACD teams of LLM and RL agents can interact by proposing a novel communication protocol. Our results highlight the strengths and weaknesses of LLMs and RL and help us identify promising research directions to create, train, and deploy future teams of ACD agents.
摘要
快速有效的应急响应对于防范恶意网络攻击至关重要。自主网络防御(ACD)旨在通过规划与执行行动的人工智能(AI)代理实现响应自动化。现有ACD方法多聚焦于单代理场景并采用强化学习(RL),但RL训练的ACD代理存在训练成本高昂、决策过程缺乏可解释性及可迁移性等局限。大型语言模型(LLMs)能通过提供通用安全场景下的可解释行动来应对这些问题。尽管已有研究探索LLM代理在ACD中的应用,但尚未评估其在多代理场景或与其他ACD代理交互时的表现。本文通过提出CybORG CAGE 4环境的新集成方案,首次研究了LLM在多代理ACD环境中的性能表现。我们设计新型通信协议,考察LLM与RL代理组成的ACD团队如何协作。实验结果揭示了LLM与RL的优势与不足,为未来ACD代理团队的创建、训练和部署指明了研究方向。
Exploring Influence Factors on LLM Suitability for No-Code Development of End User IoT Applications
Abstract
arXiv:2505.04710v1 Announce Type: new Abstract: With the increasing popularity of IoT applications, end users demand more personalized and intuitive functionality. A major obstacle for this, however, is that custom IoT functionality today still requires at least some coding skills. To address this, no-code development platforms have been proposed as a solution for empowering non-technical users to create applications. However, such platforms still require a certain level of technical expertise for structuring process steps or defining event-action relations. The advent of LLMs can further enhance no-code platforms by enabling natural language-based interaction, automating of complex tasks, and dynamic code generation. By allowing users to describe their requirements in natural language, LLMs can significantly streamline no-code development. As LLMs vary in performance, architecture, training data used, and the use cases they target, it is still unclear which models are best suited and what are the influence factors determining this fit. In particular, no-code development of IoT applications by non-technical users will have completely different demands on LLMs than, e.g., code generation for more open-ended applications or for supporting professional developers. In this paper, we explore the factors influencing the suitability of LLMs to no-code development of IoT applications. We also examine the role of input prompt language on accuracy and quality of generated applications as well as the influence of LLM training data. By conducting comprehensive experiments with a range of LLMs, we provide valuable insights for optimizing LLM-powered no-code platforms, guiding the selection of the suitable LLMs and their effective application. Our findings contribute to improving the accessibility, efficiency, and user experience of no-code IoT development, ultimately enabling broader adoption of IoT technologies among non-expert users.
摘要
随着物联网应用的日益普及,终端用户对个性化和直观功能的需求不断增长。然而当前定制化物联网功能仍需至少具备一定编程能力,这成为主要障碍。为解决该问题,无代码开发平台被提出作为赋能非技术用户创建应用的解决方案。但此类平台在构建流程步骤或定义事件-动作关系时仍需要一定技术专长。大型语言模型(LLM)的出现通过实现基于自然语言的交互、复杂任务自动化及动态代码生成,可进一步提升无代码平台能力。当用户能够以自然语言描述需求时,LLM可显著简化无代码开发流程。由于LLM在性能、架构、训练数据及应用场景方面存在差异,目前尚不清楚哪些模型最适合以及决定适配性的影响因素。特别是非技术用户进行物联网应用的无代码开发对LLM的要求,与开放式应用的代码生成或专业开发者辅助等场景存在本质区别。本文探究了影响LLM适用于物联网无代码开发的关键因素,研究了输入提示语言对生成应用准确性和质量的作用,以及LLM训练数据的影响。通过针对多种LLM开展综合实验,我们为优化基于LLM的无代码平台提供了重要见解,指导合适LLM的选择及其有效应用。本研究有助于提升无代码物联网开发的易用性、效率和用户体验,最终促进非专业用户更广泛地采用物联网技术。
Text2Cypher: Data Pruning using Hard Example Selection
Abstract
arXiv:2505.05122v1 Announce Type: new Abstract: Database query languages such as SQL for relational databases and Cypher for graph databases have been widely adopted. Recent advancements in large language models (LLMs) enable natural language interactions with databases through models like Text2SQL and Text2Cypher. Fine-tuning these models typically requires large, diverse datasets containing non-trivial examples. However, as dataset size increases, the cost of fine-tuning also rises. This makes smaller, high-quality datasets essential for reducing costs for the same or better performance. In this paper, we propose five hard-example selection techniques for pruning the Text2Cypher dataset, aiming to preserve or improve performance while reducing resource usage. Our results show that these hard-example selection approaches can halve training time and costs with minimal impact on performance, and demonstrates that hard-example selection provides a cost-effective solution.
摘要
关系型数据库的SQL和图数据库的Cypher等查询语言已被广泛采用。大型语言模型(LLMs)的最新进展使得通过Text2SQL和Text2Cypher等模型实现与数据库的自然语言交互成为可能。微调这些模型通常需要包含非平凡示例的大规模多样化数据集。然而,随着数据集规模增大,微调成本也随之上升。这使得在保持或提升性能的同时,小型高质量数据集对于降低成本至关重要。本文提出五种困难样本选择技术用于修剪Text2Cypher数据集,旨在减少资源使用的同时保持或提升性能。实验结果表明,这些困难样本选择方法可将训练时间和成本减半且对性能影响极小,证明困难样本选择是一种高性价比的解决方案。
The Promise and Limits of LLMs in Constructing Proofs and Hints for Logic Problems in Intelligent Tutoring Systems
Abstract
arXiv:2505.04736v1 Announce Type: new Abstract: Intelligent tutoring systems have demonstrated effectiveness in teaching formal propositional logic proofs, but their reliance on template-based explanations limits their ability to provide personalized student feedback. While large language models (LLMs) offer promising capabilities for dynamic feedback generation, they risk producing hallucinations or pedagogically unsound explanations. We evaluated the stepwise accuracy of LLMs in constructing multi-step symbolic logic proofs, comparing six prompting techniques across four state-of-the-art LLMs on 358 propositional logic problems. Results show that DeepSeek-V3 achieved superior performance with 84.4% accuracy on stepwise proof construction and excelled particularly in simpler rules. We further used the best-performing LLM to generate explanatory hints for 1,050 unique student problem-solving states from a logic ITS and evaluated them on 4 criteria with both an LLM grader and human expert ratings on a 20% sample. Our analysis finds that LLM-generated hints were 75% accurate and rated highly by human evaluators on consistency and clarity, but did not perform as well explaining why the hint was provided or its larger context. Our results demonstrate that LLMs may be used to augment tutoring systems with logic tutoring hints, but requires additional modifications to ensure accuracy and pedagogical appropriateness.
摘要
智能辅导系统在教授形式命题逻辑证明方面已显示出有效性,但其基于模板的解释方式限制了提供个性化学生反馈的能力。虽然大型语言模型(LLMs)为动态反馈生成提供了有前景的能力,但它们存在产生幻觉或教学上不合理的解释的风险。我们评估了LLMs在构建多步符号逻辑证明中的逐步准确性,在358个命题逻辑问题上比较了四种最先进LLMs的六种提示技术。结果显示,DeepSeek-V3在逐步证明构建中以84.4%的准确率表现出色,尤其在简单规则上表现优异。我们进一步使用性能最佳的LLM为逻辑智能辅导系统中的1,050个独特学生问题解决状态生成解释性提示,并通过LLM评分器和人类专家对20%样本的4项标准进行评估。分析发现,LLM生成的提示准确率为75%,在一致性和清晰度方面获得人类评估者的高度评价,但在解释提示的提供原因及其更大背景方面表现不佳。我们的结果表明,LLMs可用于为辅导系统增强逻辑辅导提示,但需要进一步修改以确保准确性和教学适宜性。
Enhancing Text2Cypher with Schema Filtering
Abstract
arXiv:2505.05118v1 Announce Type: new Abstract: Knowledge graphs represent complex data using nodes, relationships, and properties. Cypher, a powerful query language for graph databases, enables efficient modeling and querying. Recent advancements in large language models allow translation of natural language questions into Cypher queries - Text2Cypher. A common approach is incorporating database schema into prompts. However, complex schemas can introduce noise, increase hallucinations, and raise computational costs. Schema filtering addresses these challenges by including only relevant schema elements, improving query generation while reducing token costs. This work explores various schema filtering methods for Text2Cypher task and analyzes their impact on token length, performance, and cost. Results show that schema filtering effectively optimizes Text2Cypher, especially for smaller models. Consistent with prior research, we find that larger models benefit less from schema filtering due to their longer context capabilities. However, schema filtering remains valuable for both larger and smaller models in cost reduction.
摘要
知识图谱通过节点、关系和属性来表征复杂数据。Cypher作为一种强大的图数据库查询语言,能够实现高效的数据建模与查询。随着大语言模型的发展,自然语言问题到Cypher查询的转换(Text2Cypher)成为可能。当前主流方法是将数据库模式整合至提示词中,但复杂模式可能引入噪声、加剧幻觉现象并增加计算成本。模式过滤技术通过仅保留相关模式元素来解决这些问题,在提升查询生成质量的同时降低标记开销。本研究系统探讨了Text2Cypher任务中不同模式过滤方法,并分析了其对标记长度、性能及成本的影响。实验结果表明,模式过滤能有效优化Text2Cypher任务,尤其对小型模型效果显著。与既有研究一致,我们发现大型模型因其长上下文处理能力从模式过滤中获益较少。但值得注意的是,模式过滤在降低各类模型使用成本方面仍具有重要价值。
ChemRxivQuest: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv Preprints
Abstract
arXiv:2505.05232v1 Announce Type: new Abstract: The rapid expansion of chemistry literature poses significant challenges for researchers seeking to efficiently access domain-specific knowledge. To support advancements in chemistry-focused natural language processing (NLP), we present ChemRxivQuest, a curated dataset of 970 high-quality question-answer (QA) pairs derived from 155 ChemRxiv preprints across 17 subfields of chemistry. Each QA pair is explicitly linked to its source text segment to ensure traceability and contextual accuracy. ChemRxivQuest was constructed using an automated pipeline that combines optical character recognition (OCR), GPT-4o-based QA generation, and a fuzzy matching technique for answer verification. The dataset emphasizes conceptual, mechanistic, applied, and experimental questions, enabling applications in retrieval-based QA systems, search engine development, and fine-tuning of domain-adapted large language models. We analyze the dataset's structure, coverage, and limitations, and outline future directions for expansion and expert validation. ChemRxivQuest provides a foundational resource for chemistry NLP research, education, and tool development.
摘要
化学文献的快速扩张对研究人员高效获取领域特定知识提出了重大挑战。为支持化学领域自然语言处理(NLP)的发展,我们推出ChemRxivQuest——一个从17个化学子学科的155篇ChemRxiv预印本中提取的970组高质量问答对(QA)的精选数据集。每个问答对均明确关联至源文本片段,确保可追溯性和上下文准确性。该数据集通过结合光学字符识别(OCR)、基于GPT-4o的问答生成及模糊匹配答案验证技术的自动化流程构建而成,重点关注概念性、机理性、应用性和实验性问题,可应用于检索式问答系统、搜索引擎开发及领域适配大语言模型的微调。我们分析了数据集的结构、覆盖范围和局限性,并规划了未来扩展与专家验证的方向。ChemRxivQuest为化学NLP研究、教育及工具开发提供了基础性资源。
MARK: Memory Augmented Refinement of Knowledge
Abstract
arXiv:2505.05177v1 Announce Type: new Abstract: Large Language Models (LLMs) assist in specialized tasks but struggle to align with evolving domain knowledge without costly fine-tuning. Domain knowledge consists of: Knowledge: Immutable facts (e.g., 'A stone is solid') and generally accepted principles (e.g., ethical standards); Refined Memory: Evolving insights shaped by business needs and real-world changes. However, a significant gap often exists between a domain expert's deep, nuanced understanding and the system's domain knowledge, which can hinder accurate information retrieval and application. Our Memory-Augmented Refinement of Knowledge (MARK) framework enables LLMs to continuously learn without retraining by leveraging structured refined memory, inspired by the Society of Mind. MARK operates through specialized agents, each serving a distinct role: Residual Refined Memory Agent: Stores and retrieves domain-specific insights to maintain context over time; User Question Refined Memory Agent: Captures user-provided facts, abbreviations, and terminology for better comprehension; LLM Response Refined Memory Agent: Extracts key elements from responses for refinement and personalization. These agents analyse stored refined memory, detect patterns, resolve contradictions, and improve response accuracy. Temporal factors like recency and frequency prioritize relevant information while discarding outdated insights. MARK enhances LLMs in multiple ways: Ground Truth Strategy: Reduces hallucinations by establishing a structured reference; Domain-Specific Adaptation: Essential for fields like healthcare, law, and manufacturing, where proprietary insights are absent from public datasets; Personalized AI Assistants: Improves virtual assistants by remembering user preferences, ensuring coherent responses over time.
摘要
大语言模型(LLMs)能够辅助专业任务,但在不进行昂贵微调的情况下难以适应不断演进的领域知识。领域知识包含两方面:知识:不可变事实(如"石头是固体")和普遍接受的原则(如伦理标准);精炼记忆:由业务需求和现实变化塑造的演进见解。然而,领域专家的深刻、细致理解与系统领域知识之间常存在显著差距,这可能阻碍准确的信息检索和应用。受"心智社会"启发,我们提出的知识精炼记忆增强框架(MARK)使LLMs无需重新训练即可持续学习。MARK通过专业代理运作,每个代理承担特定职能:残余精炼记忆代理:存储并检索领域特定见解以维持长期上下文;用户问题精炼记忆代理:捕获用户提供的事实、缩写和术语以提升理解;LLM响应精炼记忆代理:从响应中提取关键要素进行精炼和个性化。这些代理分析存储的精炼记忆,检测模式,解决矛盾并提高响应准确性。通过时效性和频率等时间因素对相关信息进行优先级排序,同时淘汰过时见解。MARK从多维度增强LLMs:基准事实策略:通过建立结构化参照减少幻觉;领域特定适配:对医疗、法律和制造等缺乏公开数据专有见解的领域至关重要;个性化AI助手:通过记忆用户偏好改进虚拟助手,确保长期响应连贯性。
Multi-agent Embodied AI: Advances and Future Directions
Abstract
arXiv:2505.05108v1 Announce Type: new Abstract: Embodied artificial intelligence (Embodied AI) plays a pivotal role in the application of advanced technologies in the intelligent era, where AI systems are integrated with physical bodies that enable them to perceive, reason, and interact with their environments. Through the use of sensors for input and actuators for action, these systems can learn and adapt based on real-world feedback, allowing them to perform tasks effectively in dynamic and unpredictable environments. As techniques such as deep learning (DL), reinforcement learning (RL), and large language models (LLMs) mature, embodied AI has become a leading field in both academia and industry, with applications spanning robotics, healthcare, transportation, and manufacturing. However, most research has focused on single-agent systems that often assume static, closed environments, whereas real-world embodied AI must navigate far more complex scenarios. In such settings, agents must not only interact with their surroundings but also collaborate with other agents, necessitating sophisticated mechanisms for adaptation, real-time learning, and collaborative problem-solving. Despite increasing interest in multi-agent systems, existing research remains narrow in scope, often relying on simplified models that fail to capture the full complexity of dynamic, open environments for multi-agent embodied AI. Moreover, no comprehensive survey has systematically reviewed the advancements in this area. As embodied AI rapidly evolves, it is crucial to deepen our understanding of multi-agent embodied AI to address the challenges presented by real-world applications. To fill this gap and foster further development in the field, this paper reviews the current state of research, analyzes key contributions, and identifies challenges and future directions, providing insights to guide innovation and progress in this field.
摘要
具身人工智能(Embodied AI)在智能时代先进技术应用中发挥着关键作用,其通过将AI系统与物理载体结合,使系统能够感知、推理并与环境交互。这些系统利用传感器获取输入,通过执行器采取行动,并基于现实世界反馈进行学习与适应,从而在动态不可预测的环境中高效执行任务。随着深度学习(DL)、强化学习(RL)和大语言模型(LLM)等技术的成熟,具身AI已成为学界与工业界的前沿领域,应用涵盖机器人、医疗、交通和制造业。然而,现有研究多集中于假设静态封闭环境的单智能体系统,而现实世界的具身AI需应对更复杂的场景。在此类场景中,智能体不仅需与环境交互,还需与其他智能体协作,这就要求其具备适应机制、实时学习及协同问题解决等高级能力。尽管多智能体系统研究日益受到关注,现有工作仍局限于简化模型,未能充分捕捉动态开放环境中多智能体具身AI的完整复杂性。此外,目前尚无系统性综述全面梳理该领域的进展。随着具身AI的快速发展,深入理解多智能体具身AI对应对实际应用挑战至关重要。为填补这一空白并推动领域发展,本文回顾了当前研究现状,分析了关键贡献,指出挑战与未来方向,旨在为该领域的创新与进步提供指导性见解。
CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models
Abstract
arXiv:2505.05130v1 Announce Type: new Abstract: Large pre-trained Vision-Language Models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), have exhibited remarkable zero-shot performance across various image classification tasks. Fine-tuning these models on domain-specific datasets further enhances their effectiveness for downstream applications. However, fine-tuning in cloud environments raises significant concerns regarding data security and privacy. Federated Learning (FL) offers a decentralized solution by enabling model training across local clients without centralizing sensitive data, but the high communication and computation costs of transmitting full pre-trained models during training limit its scalability. Additionally, non-Independent and Identically Distributed (non-IID) data across local clients can negatively impact model convergence and performance. To address these challenges, we propose CacheFL, a novel federated learning method that replaces traditional full model fine-tuning with lightweight cache model fine-tuning. The cache model is initialized using a class-balanced dataset generated by a generative pre-trained model, effectively mitigating the impact of non-IID data. This cache model is then distributed to local clients for fine-tuning, and the updated parameters from each client are aggregated on the server and redistributed. With the updated cache model, the classification performance of CLIP is improved after just a few epochs. By limiting the training and communication to the cache model, CacheFL significantly reduces resource demands while ensuring data privacy and security. Extensive experiments conducted on ImageNet and 10 additional datasets demonstrate that CacheFL outperforms traditional approaches in terms of classification accuracy, resource efficiency, and privacy preservation.
摘要
大规模预训练视觉语言模型(VLMs),例如对比语言-图像预训练(CLIP),在各种图像分类任务中展现出卓越的零样本性能。在特定领域数据集上对这些模型进行微调,可进一步提升其在下游应用中的有效性。然而,云端环境中的微调引发了数据安全与隐私方面的重大隐忧。联邦学习(FL)通过允许模型在本地客户端上进行训练而无需集中敏感数据,提供了一种去中心化解决方案,但训练期间传输完整预训练模型的高通信与计算成本限制了其可扩展性。此外,本地客户端间的非独立同分布(non-IID)数据可能对模型收敛与性能产生负面影响。为解决这些挑战,我们提出CacheFL——一种新颖的联邦学习方法,以轻量级缓存模型微调替代传统的完整模型微调。该缓存模型通过生成式预训练模型生成的类别平衡数据集初始化,有效缓解非IID数据的影响。随后将该缓存模型分发至本地客户端进行微调,并将各客户端的更新参数在服务器端聚合后重新分发。借助更新的缓存模型,CLIP的分类性能仅需少量训练周期即可提升。通过将训练与通信限制在缓存模型内,CacheFL在确保数据隐私与安全的同时显著降低了资源需求。在ImageNet及另外10个数据集上的大量实验表明,CacheFL在分类准确率、资源效率与隐私保护方面均优于传统方法。
EcoAgent: An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation
Abstract
arXiv:2505.05440v1 Announce Type: new Abstract: Cloud-based mobile agents powered by (multimodal) large language models ((M)LLMs) offer strong reasoning abilities but suffer from high latency and cost. While fine-tuned (M)SLMs enable edge deployment, they often lose general capabilities and struggle with complex tasks. To address this, we propose EcoAgent, an Edge-Cloud cOllaborative multi-agent framework for mobile automation. EcoAgent features a closed-loop collaboration among a cloud-based Planning Agent and two edge-based agents: the Execution Agent for action execution and the Observation Agent for verifying outcomes. The Observation Agent uses a Pre-Understanding Module to compress screen images into concise text, reducing token usage. In case of failure, the Planning Agent retrieves screen history and replans via a Reflection Module. Experiments on AndroidWorld show that EcoAgent maintains high task success rates while significantly reducing MLLM token consumption, enabling efficient and practical mobile automation.
摘要
基于云端、由(多模态)大语言模型((M)LLMs)驱动的移动智能体虽具备强大的推理能力,但存在高延迟和高成本问题。虽然经过微调的(M)SLMs可实现边缘部署,但通常会丧失通用能力且难以处理复杂任务。为此,我们提出EcoAgent——一种面向移动自动化的边缘-云端协同多智能体框架。该框架通过云端规划智能体与两个边缘智能体(执行智能体负责动作执行,观察智能体负责结果验证)形成闭环协作。观察智能体采用预理解模块将屏幕图像压缩为简洁文本,显著降低token消耗。当任务失败时,规划智能体通过反射模块检索屏幕历史并重新规划。在AndroidWorld上的实验表明,EcoAgent在保持高任务成功率的同时,大幅减少了大语言模型的token消耗,实现了高效实用的移动自动化。
HEXGEN-TEXT2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflow
Abstract
arXiv:2505.05286v1 Announce Type: new Abstract: Recent advances in leveraging the agentic paradigm of large language models (LLMs) utilization have significantly enhanced Text-to-SQL capabilities, enabling users without specialized database expertise to query data intuitively. However, deploying these agentic LLM-based Text-to-SQL systems in production poses substantial challenges due to their inherently multi-stage workflows, stringent latency constraints, and potentially heterogeneous GPU infrastructure in enterprise environments. Current LLM serving frameworks lack effective mechanisms for handling interdependent inference tasks, dynamic latency variability, and resource heterogeneity, leading to suboptimal performance and frequent service-level objective (SLO) violations. In this paper, we introduce HEXGEN-TEXT2SQL, a novel framework designed explicitly to schedule and execute agentic multi-stage LLM-based Text-to-SQL workflows on heterogeneous GPU clusters that handle multi-tenant end-to-end queries. HEXGEN-TEXT2SQL introduce a hierarchical scheduling approach combining global workload-balanced task dispatching and local adaptive urgency-guided prioritization, guided by a systematic analysis of agentic Text-to-SQL workflows. Additionally, we propose a lightweight simulation-based method for tuning critical scheduling hyperparameters, further enhancing robustness and adaptability. Our extensive evaluation on realistic Text-to-SQL benchmarks demonstrates that HEXGEN-TEXT2SQL significantly outperforms state-of-the-art LLM serving frameworks. Specifically, HEXGEN-TEXT2SQL reduces latency deadlines by up to 1.67 (average: 1.41) and improves system throughput by up to 1.75 (average: 1.65) compared to vLLM under diverse, realistic workload conditions. Our code is available at https://github.com/Relaxed-System-Lab/Hexgen-Flow.
摘要
近年来,基于大语言模型(LLMs)智能体范式应用的重大进展显著提升了文本到SQL(Text-to-SQL)的能力,使得不具备专业数据库知识的用户能够直观地进行数据查询。然而,由于这类基于智能体LLM的文本到SQL系统本质上具有多阶段工作流程、严格的延迟约束以及企业环境中潜在的异构GPU基础设施,将其部署到生产环境面临巨大挑战。当前LLM服务框架缺乏有效机制来处理相互依赖的推理任务、动态延迟变化和资源异构性,导致性能欠佳和频繁违反服务级别目标(SLO)。本文提出HEXGEN-TEXT2SQL,这是一个专为在异构GPU集群上调度和执行基于智能体多阶段LLM的文本到SQL工作流而设计的新框架,可处理多租户端到端查询。HEXGEN-TEXT2SQL引入了一种分层调度方法,结合全局负载均衡的任务分发和局部自适应紧急度引导的优先级排序,该方法基于对智能体文本到SQL工作流的系统分析。此外,我们提出了一种基于轻量级模拟的关键调度超参数调优方法,进一步增强了系统的鲁棒性和适应性。在真实文本到SQL基准测试中的广泛评估表明,HEXGEN-TEXT2SQL显著优于最先进的LLM服务框架。具体而言,与vLLM相比,HEXGEN-TEXT2SQL在不同真实工作负载条件下将延迟截止时间缩短了最高1.67倍(平均1.41倍),并将系统吞吐量提高了最高1.75倍(平均1.65倍)。我们的代码可在https://github.com/Relaxed-System-Lab/Hexgen-Flow获取。
How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks
Abstract
arXiv:2505.04628v1 Announce Type: cross Abstract: Expanding the application of large language models (LLMs) to societal life, instead of primary function only as auxiliary assistants to communicate with only one person at a time, necessitates LLMs' capabilities to independently play roles in multi-user, multi-turn social agent tasks within complex social settings. However, currently the capability has not been systematically measured with available benchmarks. To address this gap, we first introduce an agent task leveling framework grounded in sociological principles. Concurrently, we propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM's social capabilities in comprehensive social agents tasks and benchmark representative models. HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs within realistic social interaction scenarios dataset, HSII-Dataset. The dataset is derived step by step from news dataset. We perform an ablation study by doing clustering to the dataset. Additionally, we investigate the impact of chain of thought (COT) method on enhancing LLMs' social performance. Since COT cost more computation, we further introduce a new statistical metric, COT-complexity, to quantify the efficiency of certain LLMs with COTs for specific social tasks and strike a better trade-off between measurement of correctness and efficiency. Various results of our experiments demonstrate that our benchmark is well-suited for evaluating social skills in LLMs.
摘要
扩大大型语言模型(LLMs)在社会生活中的应用,而不仅限于作为与单一用户交互的辅助工具,需要LLMs具备在复杂社会情境中独立承担多用户、多轮次社交代理任务的能力。然而,当前尚缺乏系统性评估该能力的基准测试。为此,我们首先基于社会学原理提出了一个代理任务分级框架,同时设计了一个名为“How Social Is It”(简称HSII)的新型基准测试,用于全面评估LLMs在社交代理任务中的社会能力并对代表性模型进行基准测试。HSII包含四个阶段:格式解析、目标选择、目标切换对话和稳定对话,通过源自新闻数据集逐步构建的真实社交互动场景数据集HSII-Dataset,综合评估LLMs的沟通与任务完成能力。我们通过对数据集进行聚类分析开展了消融实验,并研究了思维链(COT)方法对提升LLMs社交表现的影响。鉴于COT会消耗更多计算资源,我们进一步提出了新的统计指标COT复杂度,用以量化特定LLMs在完成特定社交任务时使用COT的效率,从而在正确性与效率评估之间实现更好平衡。大量实验结果表明,我们的基准测试能有效评估LLMs的社交技能。